Sains Malaysiana 52(9)(2023): 2725-2732
http://doi.org/10.17576/jsm-2023-5209-20
Statistical
Methods for Finding Outliers in Multivariate Data using a Boxplot and Multiple
Linear Regression
(Kaedah Statistik untuk Mencari Data Terpencil dalam Data Multivariat menggunakan Plot
Kotak dan Regresi Linear Berganda)
THEERAPHAT THANWISET & WUTTICHAI
SRISODAPHOL*
Department
of Statistics, Khon Kaen University, 40002 Khon Kaen,
Thailand
Diserahkan: 1 Disember 2022/Diterima: 15 Ogos 2023
Abstract
The objective
of this study was to propose a method for detecting outliers in multivariate
data. It is based on a boxplot and multiple linear regression. In our proposed
method, the box plot was initially applied to filter the data across all
variables to split the data set into two sets: normal data (belonging to the
upper and lower fences of the boxplot) and data that could be outliers. The
normal data was then used to construct a multiple linear regression model and
find the maximum error of the residual to denote the cut-off point. For the
performance evaluation of the proposed method, a simulation study for
multivariate normal data with and without contaminated data was conducted at
various levels. The previous methods were compared with the performance of the
proposed methods, namely, the Mahalanobis distance
and Mahalanobis distance with the robust estimators
using the minimum volume ellipsoid method, the minimum covariance determinant
method, and the minimum vector variance method. The results showed that the
proposed method had the best performance over other methods that were compared
for all the contaminated levels. It was also found that when the proposed
method was used with real data, it was able to find outlier values that were in
line with the real data.
Keywords:
Boxplot; multivariate data; multiple linear regression; outlier
Abstrak
Objektif kajian ini adalah untuk mencadangkan kaedah untuk mengesan data terpencil dalam data multivariat. Ia berdasarkan plot kotak dan regresi linear berganda. Dalam kaedah yang kami cadangkan, plot kotak pada mulanya digunakan untuk menapis data merentas semua pemboleh ubah untuk membahagikan set data kepada dua set:
data biasa (kepunyaan pagar atas dan bawah plot kotak) dan data yang boleh menjadi data terpencil. Data biasa kemudiannya digunakan untuk membina model regresi linear berganda dan mencari ralat maksimum baki untuk menandakan titik potong. Untuk penilaian prestasi kaedah yang dicadangkan, kajian simulasi untuk data normal multivariat dengan dan tanpa data tercemar telah dijalankan pada pelbagai peringkat. Kaedah sebelumnya dibandingkan dengan prestasi kaedah yang dicadangkan, iaitu, jarak Mahalanobis dan jarak Mahalanobis dengan penganggar teguh menggunakan kaedah ellipsoid isi padu minimum, kaedah penentu kovarian minimum dan kaedah varians vektor minimum. Keputusan menunjukkan bahawa kaedah yang dicadangkan mempunyai prestasi terbaik berbanding kaedah lain yang dibandingkan untuk semua tahap yang tercemar. Didapati juga apabila kaedah yang dicadangkan digunakan dengan data sebenar, ia dapat mencari nilai data terpencil yang selari dengan data sebenar.
Kata kunci: Data berbilang variasi; data terpencil;
plot kotak; regresi linear berganda
RUJUKAN
Aelst, S.V. & Rousseeuw, P. 2009. Minimum
volume ellipsoid. WIREs Computational Statistics 1: 71-82.
Anscombe, F.J. & Guttman, I. 1960. Rejection
of outliers. Technometrics 2(2): 123-147.
Belsley, D.A., Kuh, E. & Welsch,
R.E. 1980. Regression Diagnostics: Identifying Influential Data and Sources
of Collinearity. New York: John Wiley & Sons.
Cabana, E., Lillo, R.E. & Laniado, H.
2021. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Stat Papers 62: 1583-1609.
Cook, R.D. 1977. Detection of influential observations in
regression. Technometrics 19: 15-18.
Herdiani, E.T., Sari, P.P. & Sunusi, N. 2019.
Detection of outliers in multivariate data using minimum vector variance
method. Journal of Physics: Conference Series 1341(9): 092004.
Hoaglin, D.C. & Welsch, R.E. 1978. The hat
matrix in regression and ANOVA. The American Statistician 32: 17-22.
Hubert, M. & Debruyne, M. 2010.
Minimum covariance determinant. WIREs Computational Statistics 2: 36-43.
Lichtinghagen, R., Klawonn, F. & Hoffmann, G.
2020. UCI Machine Learning Repository. Irvine: University of California,
School of Information and Computer Science.
https://archive.ics.uci.edu/ml/datasets/HCV+data
Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings
of the National Institute of Sciences of India 2(1): 49-55.
Montgomery, D.C., Peck, E.A. & Vining, G.G. 2012. Introduction
to Linear Regression Analysis. 3rd ed. New York: John Wiley & Sons.
Tukey, J.W. 1977. Exploratory Data Analysis. Massachusetts:
Addison Wesley.
*Pengarang untuk
surat-menyurat; email: wuttsr@kku.ac.th
|